Improved Estimation of Entropy for Evaluation of Word Sense Induction
نویسندگان
چکیده
Information-theoretic measures are among the most standard techniques for evaluation of clustering methods including word sense induction (WSI) systems. Such measures rely on sample-based estimates of the entropy. However, the standard maximum likelihood estimates of the entropy are heavily biased with the bias dependent on, among other things, the number of clusters and the sample size. This makes the measures unreliable and unfair when the number of clusters produced by different systems vary and the sample size is not exceedingly large. This corresponds exactly to the setting of WSI evaluation where a ground-truth cluster sense number arguably does not exist and the standard evaluation scenarios use a small number of instances of each word to compute the score. We describe more accurate entropy estimators and analyze their performance both in simulations and on evaluation of WSI systems.
منابع مشابه
An Improved Big Bang-Big Crunch Algorithm for Estimating Three-Phase Induction Motors Efficiency
Nowadays, the most generated electrical energy is consumed by three-phase induction motors. Thus, in order to carry out preventive measurements and maintenances and eventually employing high-efficiency motors, the efficiency evaluation of induction motors is vital. In this paper, a novel and efficient method based on Improved Big Bang-Big Crunch (I-BB-BC) Algorithm is presented for efficiency e...
متن کاملرفع ابهام معنایی واژگان مبهم فارسی با مدل موضوعی LDA
Word sense disambiguation is the task of identifying the correct sense for the word in a given context among a finite set of possible sense. In this paper a model for farsi word sense disambiguation is presented. The model use two group of features: first, all word and stop words around target word and topic models as second features. We extract topics from a farsi corpus with Latent Dirichlet ...
متن کاملGraph Connectivity Measures for Unsupervised Parameter Tuning of Graph-Based Sense Induction Systems.
Word Sense Induction (WSI) is the task of identifying the different senses (uses) of a target word in a given text. This paper focuses on the unsupervised estimation of the free parameters of a graph-based WSI method, and explores the use of eight Graph Connectivity Measures (GCM) that assess the degree of connectivity in a graph. Given a target word and a set of parameters, GCM evaluate the co...
متن کاملConditional Structure versus Conditional Estimation in NLP Models
This paper separates conditional parameter estimation, which consistently raises test set accuracy on statistical NLP tasks, from conditional model structures, such as the conditional Markov model used for maximum-entropy tagging, which tend to lower accuracy. Error analysis on the POS tagging task shows that the actual tagging errors made by the conditionally structured model derive not only f...
متن کاملSemEval-2010 Task 14: Evaluation Setting for Word Sense Induction & Disambiguation Systems
This paper presents the evaluation setting for the SemEval-2010 Word Sense Induction (WSI) task. The setting of the SemEval-2007 WSI task consists of two evaluation schemes, i.e. unsupervised evaluation and supervised evaluation. The first one evaluates WSI methods in a similar fashion to Information Retrieval exercises using F-Score. However, F-Score suffers from the matching problem which doe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Computational Linguistics
دوره 40 شماره
صفحات -
تاریخ انتشار 2014